1 MVP


  1. Load the housing_prices.csv data set and undertake an initial exploration of the data. You will find details on the data set on the relevant Kaggle page
house <- read_csv("data/housing_prices.csv")
library(GGally)
library(tidyverse)
glimpse(house)
## Rows: 19,675
## Columns: 10
## $ longitude          <dbl> -122.23, -122.22, -122.24, -122.25, -122.25, -122.2…
## $ latitude           <dbl> 37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37…
## $ housing_median_age <dbl> 41, 21, 52, 52, 52, 52, 52, 52, 42, 52, 52, 52, 52,…
## $ total_rooms        <dbl> 880, 7099, 1467, 1274, 1627, 919, 2535, 3104, 2555,…
## $ total_bedrooms     <dbl> 129, 1106, 190, 235, 280, 213, 489, 687, 665, 707, …
## $ population         <dbl> 322, 2401, 496, 558, 565, 413, 1094, 1157, 1206, 15…
## $ households         <dbl> 126, 1138, 177, 219, 259, 193, 514, 647, 595, 714, …
## $ median_income      <dbl> 8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6…
## $ median_house_value <dbl> 452600, 358500, 352100, 341300, 342200, 269700, 299…
## $ ocean_proximity    <chr> "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NE…


  1. We expect the total_rooms of houses to be strongly correlated with total_bedrooms. Use ggpairs() to investigate correlations between these two variables.
ggpairs(house[,c("total_rooms", "total_bedrooms")])
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 200 rows containing missing values
## Warning: Removed 200 rows containing missing values (geom_point).
## Warning: Removed 200 rows containing non-finite values (stat_density).


  1. So, we do find significant correlations. Let’s drop total_bedrooms from the dataset, and use only total_rooms going forward.
house_tidy <- house %>%
  select(-total_bedrooms)

house_tidy


  1. We are interested in developing a regression model for the median_house_value of a house in terms of the possible predictor variables in the dataset.
  1. Use ggpairs() to investigate correlations between median_house_value and the predictors (this may take a while to run, don’t worry, make coffee or something).
ggpairs(house_tidy)

median_income is strongly correlated with median_house_value. This is followed by a positive correlation between median_house_value and total_bedrooms and a negative correlation between median_house_value and latitude. Furthermore, The histograms of median_house_value split by ocean_proxmityalso show some variation.

  1. Perform further ggplot visualisations of any significant correlations you find.
house_tidy %>%
  ggplot(aes(x = median_income, y = median_house_value)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

house_tidy %>%
  ggplot(aes(x = total_rooms, y = median_house_value)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

house_tidy %>%
  ggplot(aes(x = latitude, y = median_house_value)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

house_tidy %>%
  ggplot(aes(x = ocean_proximity, y = median_house_value)) +
  geom_boxplot()


  1. Shortly we may try a regression model to fit the categorical predictor ocean_proximity. Investigate the level of ocean_proximity predictors. How many dummy variables do you expect to get from it?
house_tidy %>% 
  distinct(ocean_proximity)
# expect 4 dummies for ocean_proximity


  1. Start with simple linear regression. Regress median_house_value on median_income and check the regression diagnostics.
mod1 <- lm(median_house_value ~ median_income, data = house_tidy)
par(mfrow = c(2,2))
plot(mod1)

summary(mod1)
## 
## Call:
## lm(formula = median_house_value ~ median_income, data = house_tidy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -513966  -51564  -13979   36549  403935 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    45457.0     1359.0   33.45   <2e-16 ***
## median_income  39987.0      339.9  117.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 74870 on 19673 degrees of freedom
## Multiple R-squared:  0.4129, Adjusted R-squared:  0.4129 
## F-statistic: 1.384e+04 on 1 and 19673 DF,  p-value: < 2.2e-16
# the residuals plots looks alright but there is some evidence to indicate skewness.


  1. Add another predictor of your choice. Check your assumptions, diagnostics, and interpret the model.
mod2_proximity <- lm(median_house_value ~ median_income + ocean_proximity, data = house_tidy)
summary(mod2_proximity)
## 
## Call:
## lm(formula = median_house_value ~ median_income + ocean_proximity, 
##     data = house_tidy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -479055  -41086  -11143   27912  376647 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                83581.2     1407.7  59.373  < 2e-16 ***
## median_income              35118.0      304.8 115.223  < 2e-16 ***
## ocean_proximityINLAND     -72313.7     1100.7 -65.695  < 2e-16 ***
## ocean_proximityISLAND     200480.2    29232.9   6.858 7.19e-12 ***
## ocean_proximityNEAR BAY    16629.7     1591.9  10.446  < 2e-16 ***
## ocean_proximityNEAR OCEAN  15552.9     1500.8  10.363  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65340 on 19669 degrees of freedom
## Multiple R-squared:  0.5529, Adjusted R-squared:  0.5528 
## F-statistic:  4865 on 5 and 19669 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(mod2_proximity)

# The predictors median income and ocean_proximity explains 55% of the variation in our dataset. In addition, these predictors are also significant. For every increase in median income by 35,118, the median house price increases by 83,581.2 if all other predictors remain the same. If the house is on an island, the median house prices increases by 200,480.2 if all other predictors remain the same.

2 Extension


  1. Try adding an interaction between log(medium_income) and your chosen categorical predictor. Do you think this interaction term is statistically justified?
mod3 <- lm(log(median_house_value) ~ log(median_income) + ocean_proximity + log(median_income):ocean_proximity, data = house_tidy)
summary(mod3)
## 
## Call:
## lm(formula = log(median_house_value) ~ log(median_income) + ocean_proximity + 
##     log(median_income):ocean_proximity, data = house_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.26174 -0.21895 -0.02053  0.20040  2.18571 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                  11.502680   0.011572 993.990
## log(median_income)                            0.571391   0.008457  67.567
## ocean_proximityINLAND                        -0.697405   0.015989 -43.617
## ocean_proximityISLAND                         2.032871   1.038587   1.957
## ocean_proximityNEAR BAY                      -0.053793   0.025064  -2.146
## ocean_proximityNEAR OCEAN                    -0.154993   0.023743  -6.528
## log(median_income):ocean_proximityINLAND      0.176592   0.012770  13.829
## log(median_income):ocean_proximityISLAND     -1.277647   1.028706  -1.242
## log(median_income):ocean_proximityNEAR BAY    0.078022   0.018579   4.200
## log(median_income):ocean_proximityNEAR OCEAN  0.153048   0.018212   8.404
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## log(median_income)                            < 2e-16 ***
## ocean_proximityINLAND                         < 2e-16 ***
## ocean_proximityISLAND                          0.0503 .  
## ocean_proximityNEAR BAY                        0.0319 *  
## ocean_proximityNEAR OCEAN                    6.83e-11 ***
## log(median_income):ocean_proximityINLAND      < 2e-16 ***
## log(median_income):ocean_proximityISLAND       0.2143    
## log(median_income):ocean_proximityNEAR BAY   2.69e-05 ***
## log(median_income):ocean_proximityNEAR OCEAN  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3345 on 19665 degrees of freedom
## Multiple R-squared:  0.6068, Adjusted R-squared:  0.6066 
## F-statistic:  3372 on 9 and 19665 DF,  p-value: < 2.2e-16
# obtain only a small improvement in r^2 from the interaction. 
# we'll see shortly that the correct test would be an anova() comparing both models
anova(mod2_proximity, mod3)
## Warning in anova.lmlist(object, ...): models with response
## '"log(median_house_value)"' removed because response differs from model 1
# the small p-value suggests interaction is statistically significant, but the effect is small.


  1. Find and plot an appropriate visualisation to show the effect of this interaction
house_tidy %>%
  ggplot(aes(x = log(median_income),
             y = log(median_house_value), 
             colour = ocean_proximity)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ ocean_proximity)
## `geom_smooth()` using formula 'y ~ x'

# not much evidence that the gradient of the line varies significantly with ocean_proximity. Island so a change in the gradient however, there is not enough data points to make any conclusions.